Exploratory Data Analysis

Authors
Affiliation

Ling Lu

Boston University

Luoyan Zhang

Boston University

Yinuo Wang

Boston University

Published

April 21, 2025

Introduction

This section presents a detailed data analysis of job market trends in 2024, focusing on AI-driven changes, salary disparities, and employment trends across different regions and industries.

Data Import and Cleaning

# Load necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.io as pio

Load dataset

df = pd.read_csv("lightcast_job_postings.csv")

# Display dataset summary

df.info()
df.describe()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72498 entries, 0 to 72497
Columns: 118 entries, id to naics_2022_6_name
dtypes: float64(38), object(80)
memory usage: 65.3+ MB
duplicates duration modeled_duration company min_edu_levels max_edu_levels employment_type min_years_experience max_years_experience salary ... lot_occupation_group lot_v6_specialized_occupation lot_v6_occupation lot_v6_occupation_group lot_v6_career_area naics_2022_2 naics_2022_3 naics_2022_4 naics_2022_5 naics_2022_6
count 72476.000000 45182.000000 53208.000000 7.245400e+04 72454.000000 16315.000000 72454.000000 49352.000000 8430.000000 30808.000000 ... 72454.000000 7.245400e+04 72454.000000 72454.000000 72454.000000 72454.000000 72454.000000 72454.000000 72454.000000 72454.000000
mean 1.081627 22.322695 19.737615 3.702704e+07 31.482527 2.833834 1.058768 5.486444 3.773903 117953.755031 ... 2239.204475 2.239318e+07 223931.694096 2239.204475 22.281158 58.352555 587.864590 5883.121995 58834.317125 588345.683937
std 2.807512 14.359085 12.963769 3.015089e+07 44.747433 0.584028 0.286997 3.322241 2.576739 45133.878359 ... 285.424309 2.854275e+06 28542.747473 285.424309 2.854360 18.626415 186.259064 1864.093904 18642.971892 186431.744508
min 0.000000 0.000000 0.000000 0.000000e+00 0.000000 1.000000 1.000000 0.000000 0.000000 15860.000000 ... 1111.000000 1.111101e+07 111110.000000 1111.000000 11.000000 11.000000 111.000000 1111.000000 11115.000000 111150.000000
25% 0.000000 11.000000 10.000000 6.505993e+06 2.000000 3.000000 1.000000 3.000000 2.000000 84928.500000 ... 2310.000000 2.310101e+07 231010.000000 2310.000000 23.000000 52.000000 522.000000 5223.000000 52232.000000 522320.000000
50% 0.000000 18.000000 16.000000 3.761516e+07 2.000000 3.000000 1.000000 5.000000 3.000000 116300.000000 ... 2311.000000 2.311131e+07 231113.000000 2311.000000 23.000000 54.000000 541.000000 5415.000000 54151.000000 541519.000000
75% 1.000000 32.000000 28.000000 4.330689e+07 99.000000 3.000000 1.000000 8.000000 5.000000 145600.000000 ... 2311.000000 2.311131e+07 231113.000000 2311.000000 23.000000 56.000000 561.000000 5614.000000 56149.000000 561499.000000
max 100.000000 59.000000 59.000000 1.082365e+08 99.000000 4.000000 3.000000 15.000000 14.000000 500000.000000 ... 2712.000000 2.712111e+07 271211.000000 2712.000000 27.000000 99.000000 999.000000 9999.000000 99999.000000 999999.000000

8 rows × 38 columns

Data Cleaning & Preprocessing

Drop Unnecessary Columns

Which columns should be dropped, and why?

The columns selected for removal are considered redundant because they either provide duplicate information, are unnecessary for analysis, or have more detailed equivalents in the dataset. For example, "ID" serves as a unique identifier but is often not needed for analysis, while "URL" and "ACTIVE_URLS" contain job posting links that are useful externally but not critical for data processing. Similarly, "LAST_UPDATED_TIMESTAMP" is dropped because "LAST_UPDATED_DATE" already provides update information in a more readable format. The "DUPLICATES" column, which likely flags repeated entries, is also removed since duplicates can be handled separately.

Additionally, industry and occupational classification columns like "NAICS2" to "NAICS6" and "SOC_2", "SOC_3", "SOC_5" are removed because these represent different levels of classification, and more relevant or updated versions (e.g., "NAICS_2022_2" to "NAICS_2022_6") are already present in the dataset. Removing these redundant columns helps streamline the dataset, making it more efficient to analyze without losing valuable information.

columns_to_drop = [
    "id", "duplicates", "last_updated_timestamp",
    "naics2", "naics3", "naics4", "naics5", "naics6",
    "soc_2", "soc_3", "soc_5"
]

df = df.drop(columns=[col for col in columns_to_drop if col in df.columns], inplace=False)

print("Columns after dropping:", df.columns)
df.head()
Columns after dropping: Index(['last_updated_date', 'posted', 'expired', 'duration', 'title_raw',
       'body', 'modeled_expired', 'modeled_duration', 'company',
       'company_name',
       ...
       'naics_2022_2', 'naics_2022_2_name', 'naics_2022_3',
       'naics_2022_3_name', 'naics_2022_4', 'naics_2022_4_name',
       'naics_2022_5', 'naics_2022_5_name', 'naics_2022_6',
       'naics_2022_6_name'],
      dtype='object', length=107)
last_updated_date posted expired duration title_raw body modeled_expired modeled_duration company company_name ... naics_2022_2 naics_2022_2_name naics_2022_3 naics_2022_3_name naics_2022_4 naics_2022_4_name naics_2022_5 naics_2022_5_name naics_2022_6 naics_2022_6_name
0 9/6/2024 6/2/2024 6/8/2024 6.0 Enterprise Analyst (II-III) 31-May-2024\n\nEnterprise Analyst (II-III)\n\n... False 6.0 894731.0 Murphy USA ... 44.0 Retail Trade 441.0 Motor Vehicle and Parts Dealers 4413.0 Automotive Parts, Accessories, and Tire Retailers 44133.0 Automotive Parts and Accessories Retailers 441330.0 Automotive Parts and Accessories Retailers
1 8/2/2024 6/2/2024 8/1/2024 NaN Oracle Consultant - Reports (3592) Oracle Consultant - Reports (3592)\n\nat SMX i... False NaN 133098.0 Smx Corporation Limited ... 56.0 Administrative and Support and Waste Managemen... 561.0 Administrative and Support Services 5613.0 Employment Services 56132.0 Temporary Help Services 561320.0 Temporary Help Services
2 9/6/2024 6/2/2024 7/7/2024 35.0 Data Analyst Taking care of people is at the heart of every... False 8.0 39063746.0 Sedgwick ... 52.0 Finance and Insurance 524.0 Insurance Carriers and Related Activities 5242.0 Agencies, Brokerages, and Other Insurance Rela... 52429.0 Other Insurance Related Activities 524291.0 Claims Adjusting
3 9/6/2024 6/2/2024 7/20/2024 48.0 Sr. Lead Data Mgmt. Analyst - SAS Product Owner About this role:\n\nWells Fargo is looking for... False 10.0 37615159.0 Wells Fargo ... 52.0 Finance and Insurance 522.0 Credit Intermediation and Related Activities 5221.0 Depository Credit Intermediation 52211.0 Commercial Banking 522110.0 Commercial Banking
4 6/19/2024 6/2/2024 6/17/2024 15.0 Comisiones de $1000 - $3000 por semana... Comi... Comisiones de $1000 - $3000 por semana... Comi... False 15.0 0.0 Unclassified ... 99.0 Unclassified Industry 999.0 Unclassified Industry 9999.0 Unclassified Industry 99999.0 Unclassified Industry 999999.0 Unclassified Industry

5 rows × 107 columns

Handle Missing Values

How should missing values be handled?

Missing values should be handled strategically based on their impact on analysis. First, visualizing missing data with a heatmap helps identify patterns and assess severity. Columns with more than 50% missing values are dropped to avoid unreliable or incomplete data. For numerical fields like "Salary", filling missing values with the median ensures the data remains representative without being skewed by outliers. Categorical fields like "Industry" are filled with "Unknown" to maintain completeness while preserving interpretability. This approach balances data retention and accuracy, ensuring meaningful analysis without introducing bias.

import missingno as msno
import matplotlib.pyplot as plt

# Visualize missing data
# Identify columns with >10% missing values
missing_threshold = 0.1  
missing_cols = df.columns[df.isnull().mean() > missing_threshold]

# Filter DataFrame
df_missing = df[missing_cols]

# Generate heatmap
plt.figure(figsize=(12, 6))  

msno.heatmap(df_missing)
plt.title("Missing Values Heatmap (Filtered)")
plt.show()


# Drop columns with >50% missing values
df.dropna(thresh=len(df) * 0.5, axis=1, inplace=True)

# Fill missing values

df_original = pd.read_csv("lightcast_job_postings.csv")

df["salary"] = df_original["salary"]

salary_col = "salary" if "salary" in df.columns else None

if salary_col:
    df[salary_col].fillna(df[salary_col].median(), inplace=True)
else:
    print("Warning: No salary-related column found!")


df["naics6_name"].fillna("Unknown", inplace=True)
<Figure size 1200x600 with 0 Axes>

/var/folders/pb/hmwvlqh13hzcxgxncl6cfhlr0000gn/T/ipykernel_71193/698073159.py:32: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[salary_col].fillna(df[salary_col].median(), inplace=True)
/var/folders/pb/hmwvlqh13hzcxgxncl6cfhlr0000gn/T/ipykernel_71193/698073159.py:37: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["naics6_name"].fillna("Unknown", inplace=True)

Remove Duplicates

To ensure each job is counted only once, we remove duplicates based on job title, company, location, and posting date.

print("Existing columns in DataFrame:", df.columns.tolist())  # Display actual column names

# Convert column names to lowercase for case-insensitive matching
df.columns = df.columns.str.lower()

columns_to_check = ["title", "company", "location", "posted"]
existing_columns = [col for col in columns_to_check if col in df.columns]

if not existing_columns:
    raise ValueError("None of the specified columns exist in the DataFrame. Check column names!")

print("Before removing duplicates:")
print(df[existing_columns].head())

df = df.drop_duplicates(subset=existing_columns, keep="first")

print("\nAfter removing duplicates:")
print(df[existing_columns].head())

print("\nDuplicates removed based on:", existing_columns)
Existing columns in DataFrame: ['last_updated_date', 'posted', 'expired', 'duration', 'title_raw', 'body', 'modeled_expired', 'modeled_duration', 'company', 'company_name', 'company_raw', 'company_is_staffing', 'education_levels', 'education_levels_name', 'min_edu_levels', 'min_edu_levels_name', 'employment_type', 'employment_type_name', 'min_years_experience', 'is_internship', 'remote_type', 'remote_type_name', 'location', 'city', 'city_name', 'county', 'county_name', 'msa', 'msa_name', 'state', 'state_name', 'county_outgoing', 'county_name_outgoing', 'county_incoming', 'county_name_incoming', 'msa_outgoing', 'msa_name_outgoing', 'msa_incoming', 'msa_name_incoming', 'naics2_name', 'naics3_name', 'naics4_name', 'naics5_name', 'naics6_name', 'title', 'title_name', 'title_clean', 'certifications', 'certifications_name', 'onet', 'onet_name', 'onet_2019', 'onet_2019_name', 'cip6', 'cip6_name', 'cip4', 'cip4_name', 'cip2', 'cip2_name', 'soc_2021_2', 'soc_2021_2_name', 'soc_2021_3', 'soc_2021_3_name', 'soc_2021_4', 'soc_2021_4_name', 'soc_2021_5', 'soc_2021_5_name', 'lot_career_area', 'lot_career_area_name', 'lot_occupation', 'lot_occupation_name', 'lot_specialized_occupation', 'lot_specialized_occupation_name', 'lot_occupation_group', 'lot_occupation_group_name', 'lot_v6_specialized_occupation', 'lot_v6_specialized_occupation_name', 'lot_v6_occupation', 'lot_v6_occupation_name', 'lot_v6_occupation_group', 'lot_v6_occupation_group_name', 'lot_v6_career_area', 'lot_v6_career_area_name', 'soc_2_name', 'soc_3_name', 'soc_4', 'soc_4_name', 'soc_5_name', 'naics_2022_2', 'naics_2022_2_name', 'naics_2022_3', 'naics_2022_3_name', 'naics_2022_4', 'naics_2022_4_name', 'naics_2022_5', 'naics_2022_5_name', 'naics_2022_6', 'naics_2022_6_name', 'salary']
Before removing duplicates:
                title     company  \
0  ET29C073C03D1F86B4    894731.0   
1  ET21DDA63780A7DC09    133098.0   
2  ET3037E0C947A02404  39063746.0   
3  ET2114E0404BA30075  37615159.0   
4  ET0000000000000000         0.0   

                                            location    posted  
0     {\n  "lat": 33.20763,\n  "lon": -92.6662674\n}  6/2/2024  
1   {\n  "lat": 44.3106241,\n  "lon": -69.7794897\n}  6/2/2024  
2   {\n  "lat": 32.7766642,\n  "lon": -96.7969879\n}  6/2/2024  
3  {\n  "lat": 33.4483771,\n  "lon": -112.0740373\n}  6/2/2024  
4  {\n  "lat": 37.6392595,\n  "lon": -120.9970014\n}  6/2/2024  

After removing duplicates:
                title     company  \
0  ET29C073C03D1F86B4    894731.0   
1  ET21DDA63780A7DC09    133098.0   
2  ET3037E0C947A02404  39063746.0   
3  ET2114E0404BA30075  37615159.0   
4  ET0000000000000000         0.0   

                                            location    posted  
0     {\n  "lat": 33.20763,\n  "lon": -92.6662674\n}  6/2/2024  
1   {\n  "lat": 44.3106241,\n  "lon": -69.7794897\n}  6/2/2024  
2   {\n  "lat": 32.7766642,\n  "lon": -96.7969879\n}  6/2/2024  
3  {\n  "lat": 33.4483771,\n  "lon": -112.0740373\n}  6/2/2024  
4  {\n  "lat": 37.6392595,\n  "lon": -120.9970014\n}  6/2/2024  

Duplicates removed based on: ['title', 'company', 'location', 'posted']

Exploratory Data Analysis (EDA)

EDA helps uncover patterns in job postings and salaries across industries. These insights assist job seekers in making informed career decisions.

Job Postings by Industry

Why this visualization?

This bar chart helps identify which industries have the highest number of job postings. It provides insights into industry demand, helping job seekers target sectors with more opportunities.

import plotly.express as px
import plotly.io as pio

# Set Plotly renderer for Quarto or Jupyter
pio.renderers.default = "notebook"

# Get top 20 industries by job postings
top_n = 20
industry_counts = df["naics6_name"].value_counts().nlargest(top_n).reset_index()
industry_counts.columns = ["Industry", "Count"]

# Create horizontal bar chart with a taller y-axis
fig = px.bar(
    industry_counts,
    x="Industry",
    y="Count",
    title=f"Top {top_n} Job Postings by Industry (NAICS6)",
    labels={"Industry": "Industry", "Count": "Number of Job Postings"}
)

# Extend y-axis and increase figure height
fig.update_layout(
    xaxis_title="Industry",
    yaxis_title="Number of Job Postings",
    yaxis=dict(range=[0, industry_counts["Count"].max() * 1.2]),  # Extend y-axis
    height=1000  # Increase figure height for better spacing
)

fig.show()

Insights

Custom Computer Programming Services, Administrative Management, and Employment Placement Agencies have the highest job postings, indicating strong demand in tech, consulting, and staffing sectors. Computer Systems Design and Commercial Banking also show significant job availability, reflecting growth in IT and finance. Unclassified Industry has the highest count, which may indicate miscategorized job postings or emerging sectors not yet classified. Other industries like Commercial Banking, Health Insurance, and Educational Services also show moderate job availability.

Salary Distribution by Industry

Why this visualization?

This box plot is used to analyze salary distribution across the top 20 industries. It helps compare median salaries, salary variability, and outliers, which is crucial for understanding income potential in different fields.

import plotly.express as px

# Get top 20 industries by job postings
top_n = 20
top_industries = df["naics6_name"].value_counts().nlargest(top_n).index

# Filter dataset for top industries
df_filtered = df[df["naics6_name"].isin(top_industries)]

# Create the box plot with an extended y-axis
fig = px.box(
    df_filtered,
    x="naics6_name",
    y="salary",
    title=f"Salary Distribution in Top {top_n} Industries",
    labels={"naics6_name": "Industry", "salary": "Salary ($)"},
    points="all"  # Show all outliers
)

# Extend the y-axis
fig.update_layout(
    xaxis_title="Industry",
    yaxis_title="Salary ($)",
    yaxis=dict(range=[0, df_filtered["salary"].max() * 1.2]),  # Extend y-axis 20% above max salary
    height=1000  # Increase figure height for better visibility
)

fig.show()

Insights

Commercial Banking and Tech-related industries show wide salary ranges, indicating opportunities for growth. Temporary Help Services has the lowest pay, reflecting short-term or contract roles. Tech and finance roles offer both high salaries and significant growth potential.

Remote vs. On-Site Jobs

Why this visualization?

This pie chart compares the distribution of remote, hybrid, and on-site jobs, showing workplace flexibility trends. It helps job seekers understand how common remote opportunities are in the current job market.

fig = px.pie(df, names="remote_type_name", title="Remote vs. On-Site Jobs")
fig.show()

Insights

Majority of jobs are not explicitly classified (~78.3%), which may indicate missing or unspecified remote work details in job postings. Only 17% of jobs are fully remote, suggesting that while remote work exists, it is not yet dominant in most industries. Hybrid remote jobs (3.11%) are emerging, but still a small percentage. This indicates a slow transition toward flexible work models. On-site jobs remain the norm (1.58% explicitly labeled as “Not Remote”), reinforcing that many industries still require physical presence. Remote opportunities exist but are limited, meaning job seekers should target specific industries or roles for remote work.

Geographic Variation of Job Postings

Why this visualization?

This map visualizes the number of job postings by U.S. state, offering a clear look at where opportunities are most concentrated geographically. It helps job seekers understand which states have the highest job demand.

state_counts = df["state_name"].value_counts().reset_index()
state_counts.columns = ["State", "Job Postings"]

us_state_abbrev = {
    'Alabama': 'AL', 'Alaska': 'AK', 'Arizona': 'AZ', 'Arkansas': 'AR',
    'California': 'CA', 'Colorado': 'CO', 'Connecticut': 'CT', 'Delaware': 'DE',
    'Florida': 'FL', 'Georgia': 'GA', 'Hawaii': 'HI', 'Idaho': 'ID',
    'Illinois': 'IL', 'Indiana': 'IN', 'Iowa': 'IA', 'Kansas': 'KS',
    'Kentucky': 'KY', 'Louisiana': 'LA', 'Maine': 'ME', 'Maryland': 'MD',
    'Massachusetts': 'MA', 'Michigan': 'MI', 'Minnesota': 'MN', 'Mississippi': 'MS',
    'Missouri': 'MO', 'Montana': 'MT', 'Nebraska': 'NE', 'Nevada': 'NV',
    'New Hampshire': 'NH', 'New Jersey': 'NJ', 'New Mexico': 'NM',
    'New York': 'NY', 'North Carolina': 'NC', 'North Dakota': 'ND',
    'Ohio': 'OH', 'Oklahoma': 'OK', 'Oregon': 'OR', 'Pennsylvania': 'PA',
    'Rhode Island': 'RI', 'South Carolina': 'SC', 'South Dakota': 'SD',
    'Tennessee': 'TN', 'Texas': 'TX', 'Utah': 'UT', 'Vermont': 'VT',
    'Virginia': 'VA', 'Washington': 'WA', 'West Virginia': 'WV',
    'Wisconsin': 'WI', 'Wyoming': 'WY'
}

#   Add abbreviation column
state_counts["State Abbrev"] = state_counts["State"].map(us_state_abbrev)

#   Create a choropleth map
fig = px.choropleth(
    state_counts,
    locations="State Abbrev",
    locationmode="USA-states",
    color="Job Postings",
    color_continuous_scale="Blues",
    scope="usa",
    title="Job Postings by U.S. State"
)

fig.update_layout(
    geo=dict(bgcolor='rgba(0,0,0,0)'),
    height=600
)

fig.show()

Insights Geographic Insights on Job Postings (2024–2025)

Texas and California are leading in job postings, reflecting strong economies and diverse industries (tech, energy, entertainment).Florida, New York, and Illinois also show significant job demand due to strong finance, healthcare, and logistics sectors. Southeastern and Midwestern states like North Carolina, Georgia, and Ohio provide moderate opportunities, often with lower cost of living. Lower posting states such as Wyoming, Montana, and Alaska may reflect regional economic focus or smaller labor markets.

This distribution suggests job seekers may benefit from targeting high-posting states or considering relocation for greater opportunity.

Job Postings Over Time

Why this visualization?

This time series plot shows how job demand has changed over time. It’s useful to spot trends, seasonal hiring spikes, or drops (e.g., holidays, recession).


df["posted_date"] = pd.to_datetime(df["posted"], errors="coerce")

# Group by month
monthly_postings = df.groupby(df["posted_date"].dt.to_period("M")).size().reset_index(name="Job Postings")
monthly_postings["Month"] = monthly_postings["posted_date"].dt.to_timestamp()


fig = px.line(
    monthly_postings,
    x="Month",
    y="Job Postings",
    title="Job Postings Over Time",
    labels={"Month": "Date", "Job Postings": "Number of Job Postings"}
)

fig.update_layout(height=500)
fig.show()

Insights

The line chart depicting job postings over time reveals a clear temporal trend in hiring activity throughout mid-2024. From early May to the end of June, there is a noticeable decline in the number of job postings, dropping from around 14,000 to approximately 12,200. This dip could be attributed to seasonal factors, such as the end of academic semesters, mid-year budget reviews, or general summer slowdowns in corporate recruitment cycles. However, beginning in early July, the job market shows a sharp rebound, with postings rising significantly through August, eventually stabilizing at around 14,700. This resurgence likely reflects renewed hiring efforts following budget resets or organizational planning periods. The uptick in late summer also aligns with typical Q3 hiring trends, where companies ramp up recruitment ahead of the final quarter. For job seekers, this pattern suggests that while opportunities may temporarily slow in early summer, late July through August presents a strong window for applications as employers actively seek talent.

Top Job Titles by Frequency

Why this visualization

This graph shows which roles are in highest demand — great for resume optimization and understanding what skills employers seek most often.


top_titles = df["title_name"].value_counts().nlargest(20).reset_index()
top_titles.columns = ["Job Title", "Count"]


fig = px.bar(
    top_titles,
    x="Count",
    y="Job Title",
    orientation="h",
    title="Top 20 Job Titles by Frequency",
    labels={"Count": "Number of Postings", "Job Title": "Job Title"},
    height=600
)

fig.update_layout(yaxis=dict(autorange="reversed"))  # highest at top
fig.show()

Insights

This bar chart provides a broader view of the top 20 job titles by frequency, highlighting dominant roles within the job market. “Data Analysts” clearly lead in demand, with over 8,000 postings — reaffirming the central role of data professionals in today’s workforce.

Other high-demand roles include “Business Intelligence Analysts,” “Enterprise Architects,” and “Oracle Cloud HCM Consultants,” which suggests a strong emphasis on both data strategy and cloud-based enterprise solutions. The appearance of niche titles like “SAP Consultants,” “Data Governance Analysts,” and “Data Quality Analysts” reflects organizations’ growing need for specialized expertise in maintaining and managing data infrastructure.

Overall, this visualization reinforces the importance of analytics, enterprise systems, and data architecture in the current job market.

Top Job Titles by Frequency

Why this visualization Compares earning potential across work modes — important for evaluating the financial trade-offs between remote, hybrid, and on-site roles.


df_salary = df[df["salary"].notnull() & (df["salary"] < 300000)]

# Box plot: salary by remote_type_name
fig = px.box(
    df_salary,
    x="remote_type_name",
    y="salary",
    title="Salary Distribution by Job Type (Remote vs. On-Site)",
    labels={"remote_type_name": "Job Type", "salary": "Salary ($)"},
    points="all"
)

fig.update_layout(height=600)
fig.show()

Insights

This box plot presents the salary distribution across different job types, comparing remote, hybrid, on-site, and unclassified roles. A key takeaway is that salaries for remote and hybrid roles tend to be higher and more consistent than those for on-site positions. Remote jobs, in particular, show a tight interquartile range clustered around $110K–$125K, indicating a strong market demand and willingness to pay for location-flexible roles.

In contrast, on-site (“Not Remote”) jobs show a broader distribution, with a wider range of salaries and a lower median. This suggests more variability in pay, possibly due to a wider mix of roles (from entry-level to senior) or differing cost-of-living adjustments by region.

Overall, the visualization reinforces a trend seen across industries: remote and hybrid roles are not only desirable for flexibility but also competitive in compensation. Job seekers aiming for high-paying opportunities may benefit from targeting remote-friendly employers, especially in tech and analytics fields.

Conclusion

The exploratory data analysis (EDA) provided a comprehensive overview of the 2024 job market landscape by examining job postings, salaries, and structural patterns across various dimensions.

  1. Top Industries by Postings A bar chart of the most active industries revealed strong hiring demand in sectors like technology, consulting, and staffing. This highlights which fields are driving employment opportunities.

  2. Salary Distribution by Industry A box plot compared salary ranges across top industries, showing clear disparities in compensation. Sectors like finance and tech offered both high pay and broad salary variability, while others showed lower, more consistent pay.

  3. Remote vs. On-Site Jobs A pie chart illustrated that the majority of job postings lacked explicit remote classification, but among those that did, remote jobs were more prevalent than on-site roles. This reflects the growing demand for flexible work arrangements.

  4. Job Postings by U.S. State A choropleth map identified geographic disparities in job availability, with states like Texas, California, and Florida leading in job volume. This helps job seekers target locations with strong hiring activity.

  5. Job Postings Over Time A time series line chart revealed a seasonal trend: a dip in postings during early summer followed by a strong recovery in July and August. This suggests mid-year slowdowns and hiring rebounds tied to business cycles.

  6. Top Job Titles by Frequency A horizontal bar chart showed that “Data Analyst” and related roles dominate the job market. The presence of various analyst titles reflects high demand for data-driven decision-making skills across industries.

  7. Salary Distribution by Job Type A box plot comparing remote, hybrid, and on-site roles indicated that remote and hybrid jobs offer competitive — often higher — median salaries. This supports the idea that flexibility and compensation can go hand in hand.

Together, these seven visualizations provide a well-rounded understanding of the job market’s current state — from what roles are most in demand, where they are, how much they pay, and how work modes affect earning potential. This foundation supports deeper analysis and career strategy planning for job seekers and workforce analysts alike.